Udacity DAN Exploratory Data Analysis Red Wine Quality by Quentin THOMAS ========================================================

This report explores a dataset containing chemical compositions and measurements for approximately 1 600 red wines.

Univariate Plots Section

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Our dataset consists of 13 variables, for 1599 red wines.Because it seems that there are a lot of outliers, I decide to remove the top 1% of some variables.

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1.0   Min.   : 4.600   Min.   :0.1200   Min.   :0.0000  
##  1st Qu.: 411.2   1st Qu.: 7.100   1st Qu.:0.3900   1st Qu.:0.0900  
##  Median : 810.5   Median : 7.900   Median :0.5200   Median :0.2500  
##  Mean   : 806.1   Mean   : 8.329   Mean   :0.5202   Mean   :0.2677  
##  3rd Qu.:1199.8   3rd Qu.: 9.200   3rd Qu.:0.6300   3rd Qu.:0.4200  
##  Max.   :1599.0   Max.   :15.900   Max.   :1.0100   Max.   :0.7900  
##  residual.sugar    chlorides       free.sulfur.dioxide
##  Min.   :0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.:1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median :2.200   Median :0.07900   Median :13.00      
##  Mean   :2.426   Mean   :0.08285   Mean   :15.13      
##  3rd Qu.:2.600   3rd Qu.:0.08900   3rd Qu.:21.00      
##  Max.   :8.300   Max.   :0.35800   Max.   :47.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.860   Min.   :0.3300  
##  1st Qu.: 21.00       1st Qu.:0.9956   1st Qu.:3.220   1st Qu.:0.5500  
##  Median : 36.50       Median :0.9967   Median :3.310   Median :0.6200  
##  Mean   : 43.78       Mean   :0.9967   Mean   :3.316   Mean   :0.6452  
##  3rd Qu.: 59.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7200  
##  Max.   :143.00       Max.   :1.0032   Max.   :4.010   Max.   :1.1600  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.45   Mean   :5.661  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

X variable is not used. I remove it from the dataframe.

Following the french wikipedia page about the dioxide sulfur in oenology (https://fr.wikipedia.org/wiki/Dioxyde_de_soufre_en_œnologie), it is possible to infer the sulfur combination from total and free sulfur values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   12.00   20.00   28.65   37.00  128.00

Now I can display the spread of these variables.

What is the structure of your dataset?

This data set contains 1,599 red wines with 13 variables on the chemical properties of the wine. It is interesting to notice that even if the quality is rated by experts, it is still an subjective variable.

Most of the variables are Numerics, the others are Integer. Except PH and density, the other variables are mainly screwed to the right.

What is/are the main feature(s) of interest in your dataset?

Which chemical properties influence the quality of red wines?

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

According to the associated text file, I think that acidity, citric acid and chlorides should have an impact on the wine quality. I am also curious to see the correlation between sugar and alcohol.

Did you create any new variables from existing variables in the dataset?

Following the french wikipedia page about the dioxide sulfur in oenology (https://fr.wikipedia.org/wiki/Dioxyde_de_soufre_en_œnologie), I have created the sulfur combination variable wich is equal to the total sulfur minus the free sulfur value.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I was not sure about the chlorides distribution, so I used a log10 transformation. As the X variable is not used, I have also removed it from the dataframe.

Bivariate Plots Section

Quality Correlation

Only alcohol, sulphates and volatil acidity have a meaningful correlation with quality. As a reminder:

  • small correlation > 0.3
  • medium correlation > 0.5
  • strong correlation > 0.7

The correlation is good but regarding the discrete nature of quality variable it is better to use boxplots.

We can observe that the quality of wines improve when alcohol and sulphates increase and volatile acidity decrease.

Now I want to find new correlation for these three variables.

Alcohol Correlation

As correlation with quality has already been studied, I only check the correlation with density.

Alcohol tends to decrease while density increase.

Sulphates Correlation

Interesting, sulphates have a acceptable correlation with volatile acidity which is already correlated with the quality. Citric acid is also qualified.

Volatile acidity decreases as sulphates increases.

Citric acid increases with sulphates.

Volatile Acidity Correlation

As for the sulphates, citric acid is correlated with volatile acidity.

Citric acid tends to decrease as volatile acidity increases.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I have found only three variables which are correlated with the quality: * Alcohol, * Sulphates, * Volatile acidity.

Volatile acidty was described in the associated text file as something which can lead to an unpleasant, vinegar taste, so I am not surprise about its correlation, and the boxplot clearly shows that it is a variable that you should keep low if you want a good wine.

In the other hand, Sulphates and Alcohol seem to improve the quality of the wine.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I wanted to explore these three variables to discover new correlations and I found a trio with: Citric acid, Sulphates, Volatile acidity.

As expected because described in the text file, Alcohol is correlated with density. However I am surprised that residual sugar has no impact on it.

What was the strongest relationship you found?

The strongest relationship I found was about the wine citric acid and the wine volatile acidity.

Multivariate Plots Section

The association of volatile acidity and alcohol seems to be a good classifier. We can clearly observe darker points (that means quality) on the right bottom of the graphic. Therefore a high alcohol with a low volatile acidity is a good attribute for wine.

Once again a good classifier which use alcohol and density. Even better than the previous one. Density, as volatile acidity, has to be keeped low.

As volatile acidity and acid citric are correlated variables I expected that PH will impact the quality of wine. This graphic shows that it is not the case and wines of each quality can have a PH between 3 and 4 without impact on the quality.

This graphic shows that sulphates have a strong impact on the quality of the wines and are present at a high rate for the betters.

This graphic is a good synthesis and I will keep it for the final plot. Maybe it will be better if I group the quality rating. We can observe on it the three variables which impact directly the quality of wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Volatile acidity vs Alcohol and Density vs Alcohol plots show very well that a classification is possible only with these variables.

The graphic representation with the three best correlated variables for quality speaks for itself: high sulphates and alcohol and low volatile acidity produce the best wines.

Were there any interesting or surprising interactions between features?

I expected that the PH could have a role in the wine taste, but the graphics show that wines are more or less well balanced.


Final Plots and Summary

Plot One

Description One

As the associated text file explain, volatile acidity has a direct impact on the quality of the wine. As the volatile acidity decrease, quality of wine increase.

Plot Two

Description Two

This graphic is really interesting as it can almost be used as a classifier. It shows how alcohol (and its correlated variable density) has a strong affect on wine (small density and high alcohol).

Plot Three

Description Three

By far my favourite visualisation as it shows the combined action of the three correlated variables and can give a simple rule for wine making (low volatile acidity, high alcohol and enough sulphates).


Reflection

This dataset contains 1 599 observations about red wines and their chemical properties. The variable I tried to explain, and which I assume that was the most interesting one was the quality rating.

I faced a first issue just by reading the text file description. The quality variable is a rating given by wine tasters. “Expert wine tasters” in order to be more precise, but it is still a subjective opinion. For this reason I expected to find less correlated variables.

It was indeed the case, and I found only three correlated variables with quality (alcohol, volatile acidity and sulphates) and one hidden because not in direct correlation with quality (acid citric).

Because there are not a lot of correlated variables, and because the explained variables is subjective, it shows that wine is complexe and maybe should deserve a bigger survey with different categories rating instead of a single and rounded global rating.